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is claimed is: 

A computer implemented method for determining the direction of a consensus 
sequence of a cluster of sequences with contradictions in directions comprising: 

determining the probability (b) that the contradictions are explained by 
random errors accordi ng to a st atistical model and the weighted number of 
contradictory sequences in the cluster; and 

defining the direction of majority of the sequences as the direction of the 
consensus sequence if the probability is the same as or greater than a threshold 
value (7) and x^n/2. 

The method of Claim 1 wherein the statistical model is a binomial distribution 
and the probability is calculated as follows: 



wherein n is the weighted number of the sequences in the cluster; p is the 
probability of random errors resulting in the contradictions; and x is the number of 
the contradictory sequences. 

The method of Claim 2 wherein CDS and mRNA sequences carry a higher weight 
than 5' EST or 3' EST; directionless EST carrys a weight of 0. 

The method of Claim 2 wherein the weights to different types of sequences are the 
same. 

The method of Claim 2 wherein the threshold value is around 0.001 . 
The method of Claim 2 wherein the threshold value is around 0.002. 
The method of Claim 2 wherein the threshold value is around 0.003. 



b(x;n,p) = 



x\(n -x)! 



n-x 
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The method of Claim 2 further comprising defining the direction of majority of 
the sequences as the direction of the consensus sequence if the probability is lower 
than the threshold value and x< n*(P t ). 

The method of Claim 8 further comprising further subclustering for the minority 
direction and majority direction if the probability is smaller than the threshold 
value and x>n*P. 

The method of Claim 9 wherein the p is between 0.03-0.10. 

The method of Claim 10 wherein the p is around 0.06. 

The method of Claim 1 1 wherein the p is determined according to binomial 
frequency distribution of contradictory sequences in a plurality of clusters or 
subclusters of sequences. 

A method of selecting sequences for designing a probe array comprising: 

cleaning raw sequences; 

refining clusters of the raw sequences; and 

generating candidate design sequences, wherein the candidate design sequences 
are exemplar or consensus sequences of the clusters. 

The method of Claim 13 wherein the cleaning comprises 
removing withdrawn sequences; 
screening and filtering and masking raw sequences; and 
triming terminal ambiguous sequence regions. 

The method of Claim 13 wherein the refining includes two level clustering. 



31 



cket No.: 3262.1 



The method of Claim 13 wherein the generating comprises: 
selecting exemplary sequences. 

The method of Claim 16 wherein the generating comprises: 

generating alignments of sequences within clusters; 

calling consensus sequence bases according to consensus calling rules; and 

determining consensus sequence direction. 

The method of Claim 17 wherein the determining comprises defining the direction 
of sequences in the clusters as the consensus sequence direction if there is no 
contradictory sequence directions. 

The method of Claim 1 8 wherein the determining further comprises 

determining the probability (b) that the contradictions are explained by 

random errors according to a statistical model and the weighted number of 

contradictory sequences in the cluster; and 

defining the direction of majority of the sequences as the direction of the 

consensus sequence if the probability is the same as or greater than a threshold 

value (7) and x^n/2. 

The method of Claim 1 9 wherein the statistical model is a binomial distribution 
and the probability is calculated as follows: 

b(x;n,p)= nl p'(l- jP r x 
x\(n-x)\ 

wherein n is the weighted number of the sequences in the cluster; p is the 
probability of random errors resulting in the contradictions; and x is the number of 
the contradictory sequences. 

The method of Claim 20 wherein CDS and mRNA sequences carry a higher 
weight than 5 5 EST or 3' EST; directionless EST carrys a weight of 0. 
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The method of Claim 21 wherein the weights to different types of sequences are 
the same. 

The method of Claim 22 wherein the threshold value is around 0.001. 

The method of Claim 22 wherein the threshold value is around 0.002. 

The method of Claim 22 wherein the threshold value is around 0.003. 

The method of Claim 22 further comprising defining the direction of majority of 
the sequences as the direction of the consensus sequence if the probability is lower 
than the threshold value and x< n*(P t ). 

The method of Claim 26 further comprising further subclustering for the minority 
direction and majority direction if the probability is smaller than the threshold 
value and x>n*P. 

The method of Claim 27 wherein the p is between 0.03-0.10. 

The method of Claim 27 wherein the p is around 0.06. 

The method of Claim 27 wherein the p is determined according to binomial 
frequency distribution of contradictory sequences in a plurality of clusters or 
subclusters of sequences. 
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